Add respectNodePodLimits scheduler flag to enforce per-node pod capacity by dejanzele · Pull Request #4841 · armadaproject/armada

dejanzele · 2026-04-16T11:31:39Z

Summary

Adds scheduling.respectNodePodLimits feature flag (default false) that enables the scheduler to track pods as a resource and reject scheduling to nodes that have exhausted their pod limit (node.Status.Allocatable["pods"])
When enabled, the scheduler programmatically registers pods in supportedResourceTypes and indexedResources at startup, and injects pods: 1 into every job's internal resource requirements
The executor now always reports non-Armada pod count in NonArmadaAllocatedResources so the scheduler can subtract system/DaemonSet pods from available capacity

Fixes #4515

This PR builds on top of #4517 and big thanks for the initial work to @Sovietaced

Operator upgrade notes

Executor change is unconditional. After the executor upgrade, NonArmadaAllocatedResources gains a pods key in every report regardless of whether any scheduler has the flag enabled. Dashboards, metrics, or custom consumers that iterate this map generically (e.g. sum over all keys) will start including pod counts. Audit prometheus / Grafana panels before rollout.
Rollback is clean. Reverting the scheduler flag to false stops the scheduler from tracking pods; reverting the executor binary removes the pods key from its reports. Neither requires data migration.
Rolling upgrade order is flexible. Old scheduler + new executor is safe (scheduler's FromNodeProto silently drops unknown resources). New scheduler + old executor briefly overestimates free pod capacity by the count of non-Armada pods per node (~10-30 typically, DaemonSets + system pods), since the old executor does not report them. The overestimate resolves as executors are upgraded.

Known limitations

pods is not added to dominantResourceFairnessResourcesToConsider. On dense-pod nodes (e.g. GKE's 110-pod limit) a queue running many small pods can monopolize pod slots without a fair-share penalty. Deferred per reviewer request; follow-up if this becomes a problem in practice.

greptile-apps · 2026-04-16T11:37:15Z

Greptile Summary

This PR adds a respectNodePodLimits scheduler feature flag (default false) that tracks Kubernetes per-node pod capacity as a first-class scheduling constraint. When enabled, the scheduler registers pods in SupportedResourceTypes and IndexedResources at startup and injects pods: 1 into every job's resource requirements via JobDb.getResourceRequirements. The executor unconditionally adds pods: 1 to NonArmadaAllocatedResources for each non-Armada pod so the scheduler can subtract system/DaemonSet pod consumption from allocatable capacity; this is a safe no-op when the flag is off.

The implementation is well-structured with good test coverage across configuration, jobdb, nodedb matching, and an end-to-end preempting-scheduler integration test.

Confidence Score: 5/5

This PR is safe to merge; no P0 or P1 issues found.

All changed code paths are correct: safeGetRequirements returns a fresh map so mutation is safe, ApplyRespectNodePodLimits is idempotent and called before ResourceListFactory construction, Clone correctly propagates the flag, and the executor's unconditional pods injection is silently dropped by the scheduler when the feature is off. Comprehensive test coverage spans unit, lifecycle, matching, and integration levels.

No files require special attention.

Important Files Changed

Filename	Overview
internal/scheduler/configuration/configuration.go	Adds RespectNodePodLimits flag and ApplyRespectNodePodLimits helper that normalises the pods resolution to 1 and registers it in both SupportedResourceTypes and IndexedResources; idempotent and safe.
internal/scheduler/jobdb/jobdb.go	Injects pods: 1 into each job's resource requirements when the flag is on; the map returned by safeGetRequirements -> K8sResourceListToMap is always a fresh copy so mutation is safe; Clone correctly propagates the flag.
internal/executor/utilisation/cluster_utilisation.go	Unconditionally injects pods: 1 per non-Armada pod into NonArmadaAllocatedResources; correctly silenced by the scheduler's ResourceListFactory when the flag is off.
internal/scheduler/schedulerapp.go	ApplyRespectNodePodLimits is called before ResourceListFactory construction and SetRespectNodePodLimits is called before the scheduler starts, so no jobs are created in the window between the two calls.
internal/scheduler/scheduling/preempting_queue_scheduler_test.go	Large end-to-end integration test covering single-job scheduling, gang atomic rejection, preemption to free pod slots, and cross-node placement preference.
internal/scheduler/nodedb/nodedb_test.go	New TestNodeBindingEvictionUnbinding_ReleasesPodSlot test verifies the bind->evict->unbind lifecycle correctly releases pod slots; tests re-bind to prove the freed slot is usable.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Scheduler startup] --> B{RespectNodePodLimits?}
    B -- false --> C[Normal scheduling\nno pod-slot tracking]
    B -- true --> D[ApplyRespectNodePodLimits\nregisters pods in\nSupportedResourceTypes\n+ IndexedResources]
    D --> E[ResourceListFactory\ntracks pods dimension]
    E --> F[JobDb.SetRespectNodePodLimits\ntrue]

    subgraph Executor
        G[Non-Armada pod scan] --> H[allocatedByPriorityAndResourceTypeFromPods\ninjects pods:1 per pod]
        H --> I[NonArmadaAllocatedResources\nreported to scheduler]
    end

    subgraph Scheduler per-job
        F --> J[JobDb.getResourceRequirements\ninjects pods:1 into job requirements]
        J --> K[NodeDb capacity check\npods consumed on bind\nfreed on evict+unbind]
    end

    I --> L{Scheduler has pods\nin SupportedResourceTypes?}
    L -- yes --> K
    L -- no --> M[pods field silently dropped\nby ResourceListFactory]

_{Reviews (18): Last reviewed commit: "Merge branch 'master' into respect-node-..." | Re-trigger Greptile}

dejanzele · 2026-04-17T13:58:15Z

@greptileai

dejanzele · 2026-04-24T13:37:29Z

@greptileai

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>

Sovietaced · 2026-05-01T00:28:49Z

Thanks for landing this. We'll try it out at some point

…n-preemptible (#4879) #### What type of PR is this? /kind bug /kind cleanup #### What this PR does / why we need it The scheduler can over-pack a node when non-preemptible jobs at a lower priority hold all of its resources and a higher-priority job shows up. The higher-priority job lands on the node anyway, putting it over its declared capacity. The same gap is there for cpu, memory, and pods. Three pieces are involved, each one defensible on its own: 1. `MarkAllocated(p, rs)` in `internaltypes/resource_list_map_util.go:67` only deducts allocatable from priorities `<= p`. From a higher-priority view, lower-priority resources look free, because the assumption is "I could just preempt them if I needed to." 2. The rebalance eviction phase (`preempting_queue_scheduler.go:118`) skips non-preemptible jobs. 3. The OversubscribedEvictor (`eviction.go:164`) is the safety net for exactly this situation. It does detect the negative allocatable, but it also refuses to evict non-preemptible jobs, so it ends up with nothing to do. For preemptible incumbents the assumption holds and eviction does its job. For non-preemptible ones the assumption is wrong and the over-pack stays. The fix lives in `nodedb/nodedb.go`. The bind/unbind/evict paths now compute a `priorityCutoffFor(job, scheduledPriority)`: preemptible jobs use their scheduled priority as the cutoff (existing behavior), non-preemptible jobs use a sentinel `nonPreemptibleCutoff = math.MaxInt32` so the existing `markAllocated`/`markAllocatable` helpers deduct (or release) at every real priority. Once `AllocatableByPriority` reflects what the node really has free, the matcher and the OversubscribedEvictor do the right thing without any further changes. The PR has two commits so the bug and the fix are easy to see separately: - `Reproducer:` adds `TestPreemptingQueueScheduler_NonPreemptibleOverPack`. It uses cpu rather than pods so the assertion is on the priority model itself. Run against this commit alone, the test fails. - `Deduct non-preemptible...` is the fix. The reproducer now passes and nothing else in the suite needed touching, except `TestEviction`, which had hardcoded expected values reflecting the pre-fix accounting at high priorities. Updated. #### Which issue(s) this PR fixes Fixes # #### Special notes for your reviewer A few things to flag: I found this while validating #4841 (the `respectNodePodLimits` flag). That PR works fine in the common cases (preemptible incumbents, free slots, gangs); it just surfaces this older scheduler-wide issue, which is what's being addressed here at its real root. Performance: each bind/unbind/evict for a non-preemptible job iterates all priority levels (about 7 in practice) instead of `<= p`. Invisible at scale. Behavior change for fair share: non-preemptible jobs now consume from higher-priority queues' "available" budget. If any workload was implicitly counting on the over-allocation, it would show up as different scheduling decisions. Happy to flag-gate the rollout if that's a concern. I ran the full test suite locally across `internal/scheduler/...`, `internal/executor/...`, `internal/server/...`, and `internal/scheduleringester`. Everything passes. --------- Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>

greptile-apps Bot reviewed Apr 16, 2026

View reviewed changes

Comment thread internal/scheduler/jobdb/jobdb_test.go Outdated

Comment thread internal/executor/utilisation/cluster_utilisation.go Outdated

dejanzele force-pushed the respect-node-pod-limits branch 8 times, most recently from 4faa4a9 to 57a9176 Compare April 17, 2026 13:54

dejanzele force-pushed the respect-node-pod-limits branch 7 times, most recently from 2008bd1 to ef28b83 Compare April 24, 2026 13:36

dejanzele mentioned this pull request Apr 24, 2026

fix: update armada to respect node pod limits #4517

Closed

JamesMurkin previously approved these changes Apr 24, 2026

View reviewed changes

dejanzele dismissed JamesMurkin’s stale review via 257b292 April 27, 2026 10:59

dejanzele force-pushed the respect-node-pod-limits branch from ef28b83 to 257b292 Compare April 27, 2026 10:59

dejanzele mentioned this pull request Apr 27, 2026

fix: scheduler over-packs nodes when lower-priority incumbents are non-preemptible #4879

Merged

dejanzele force-pushed the respect-node-pod-limits branch 2 times, most recently from 58fb73a to 4110640 Compare April 27, 2026 12:42

Add respectNodePodLimits scheduler flag to enforce per-node pod capacity

70b5707

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>

dejanzele force-pushed the respect-node-pod-limits branch from 4110640 to 70b5707 Compare April 27, 2026 12:56

JamesMurkin approved these changes Apr 27, 2026

View reviewed changes

Merge branch 'master' into respect-node-pod-limits

9045fbe

dejanzele enabled auto-merge (squash) April 27, 2026 15:05

dejanzele merged commit 0f9932b into armadaproject:master Apr 27, 2026
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add respectNodePodLimits scheduler flag to enforce per-node pod capacity#4841

Add respectNodePodLimits scheduler flag to enforce per-node pod capacity#4841
dejanzele merged 2 commits into
armadaproject:masterfrom
dejanzele:respect-node-pod-limits

dejanzele commented Apr 16, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

dejanzele commented Apr 17, 2026

Uh oh!

dejanzele commented Apr 24, 2026

Uh oh!

Uh oh!

Sovietaced commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dejanzele commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Operator upgrade notes

Known limitations

Uh oh!

greptile-apps Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

dejanzele commented Apr 17, 2026

Uh oh!

dejanzele commented Apr 24, 2026

Uh oh!

Uh oh!

Sovietaced commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dejanzele commented Apr 16, 2026 •

edited

Loading

greptile-apps Bot commented Apr 16, 2026 •

edited

Loading